Skip to content

Experiment: ResumableParser#994

Draft
byroot wants to merge 1 commit into
ruby:masterfrom
byroot:resumable-parser
Draft

Experiment: ResumableParser#994
byroot wants to merge 1 commit into
ruby:masterfrom
byroot:resumable-parser

Conversation

@byroot
Copy link
Copy Markdown
Member

@byroot byroot commented Jun 5, 2026

Fix: #983

Numerous known issues TODO:

  • Performance: This new feature shouldn't significantly degrade classic JSON.parse. Right now twitter.json is 7-10% slower, that's not OK. We might need to duplicate the parsing loop if necessary.
    • Edit: after removing write barriers in rvalue_stack_push brough most of the performance back.
    • Figure out a clean way not WB protect cResumableParser but not the embedded rvalue_stack.
  • object_start_cursor recorded in frame becomes invalid if the buffer string is reallocated or spilled.
  • The buffer need to be shrunk sometimes.
  • Lot more testing needed.
  • Unclear what to do with top level numbers (and perhaps true/false/null)
  • API is all but final
    • I'd like to be able to "pop" the value, so we don't uselessly keep a reference on it.
    • Then methods need to be documented.
  • It would worth trying to make json_parse_any exception free.
    • Right now EOF errors have been eliminated in favor of returning false.
    • We could try to do the same with syntax errors.
    • But then we need to rb_protect when calling back into Ruby or other unsafe APIs, so perhaps it's best to just accept it.

@byroot byroot force-pushed the resumable-parser branch 3 times, most recently from 497df78 to 6162ba8 Compare June 5, 2026 15:36
byroot added a commit that referenced this pull request Jun 5, 2026
Extracted from: #994

Modern compilers shouldn't have problem computing `strlen` at
compile time and generating the same code.
matzbot pushed a commit to ruby/ruby that referenced this pull request Jun 5, 2026
Extracted from: ruby/json#994

Modern compilers shouldn't have problem computing `strlen` at
compile time and generating the same code.

ruby/json@b07f74bd73
Comment on lines +229 to +231
if (*handle) {
RB_OBJ_WRITTEN(*handle, Qundef, value);
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to account for most of the perf regression on twitter.json.

I think instead of making rvalue_stack WB protected, we could just not embed it, or just have a secondary non-protected object just to mark it.

@kou
Copy link
Copy Markdown
Member

kou commented Jun 5, 2026

Great!

Here are some random notes:

MessagePack like API (splitting a parsing API to appending new data, parsing buffer and getting parsed data) is good.

MessagePack also uses feed but feed (a parser feeds data) may be a bit strange.

parser.consume(data), parser.refill(data) or something may be better. I used "consume" when I implemented a resumable parser in Apache Arrow ( apache/arrow#6804 ) but it uses callback style API and "consume" fills a buffer and parses the buffer. With this API, "consume" may be a bit strange too.

How about #value returns parsing data even when #parse returns false something like the following?

diff --git a/ext/json/ext/parser/parser.c b/ext/json/ext/parser/parser.c
index 749d594..f0731ac 100644
--- a/ext/json/ext/parser/parser.c
+++ b/ext/json/ext/parser/parser.c
@@ -2254,9 +2254,8 @@ static VALUE cResumableParser_parse(VALUE self)
 static VALUE cResumableParser_value(VALUE self)
 {
     JSON_ResumableParser *parser = cResumableParser_get(self);
-    json_frame *frame = json_frame_stack_peek(&parser->frames);
 
-    if (frame->phase == JSON_PHASE_DONE) {
+    if (parser->state.value_stack->head > 0) {
         return *rvalue_stack_peek(parser->state.value_stack, 1);
     } else {
         rb_raise(rb_eArgError, "no ready value"); // TODO: Figure out the best exception and message

@tompng shared an use case of resumable parser:

  • It may be useful to process generative AI API response
  • It's returned as a stream
  • An application wants to display the response before a response isn't completed

For example:

  1. Response: {"message": "This is a response. (not completed)
  2. App: Show This is a response
  3. Response: More messages. (not completed)
  4. App: Append More messages. to its view
  5. ...

The above diff doesn't satisfy this use case but we can use it for simpler case such as [1,. We can get [1] before we have rest data (something like , 2]).

If we also provide an API that returns not processed data something like the following, we may be able to cover the generative AI API response use case:

diff --git a/ext/json/ext/parser/parser.c b/ext/json/ext/parser/parser.c
index 749d594..d39ba11 100644
--- a/ext/json/ext/parser/parser.c
+++ b/ext/json/ext/parser/parser.c
@@ -2263,6 +2263,15 @@ static VALUE cResumableParser_value(VALUE self)
     }
 }
 
+static VALUE cResumableParser_rest(VALUE self)
+{
+    JSON_ResumableParser *parser = cResumableParser_get(self);
+
+    return rb_str_substr(parser->buffer,
+                         parser->state.cursor - parser->state.start,
+                         parser->state.end - parser->state.cursor);
+}
+
 void Init_parser(void)
 {
 #ifdef HAVE_RB_EXT_RACTOR_SAFE
@@ -2289,6 +2298,7 @@ void Init_parser(void)
     rb_define_method(cResumableParser, "feed", cResumableParser_feed, 1);
     rb_define_method(cResumableParser, "parse", cResumableParser_parse, 0);
     rb_define_method(cResumableParser, "value", cResumableParser_value, 0);
+    rb_define_method(cResumableParser, "rest", cResumableParser_rest, 0);
 
     CNaN = rb_const_get(mJSON, rb_intern("NaN"));
     rb_gc_register_mark_object(CNaN);

@byroot
Copy link
Copy Markdown
Member Author

byroot commented Jun 6, 2026

but feed (a parser feeds data) may be a bit strange.

Agreed. As I was working on this, I was thinking of simply renaming it to <<.

How about #value returns parsing data even when #parse returns false something like the following?

So your snippet wouldn't work, because the Hash and Array are only built once complete.

e.g. [1, 2, 3, the first element on the value_stack is 1, not [1, 2, 3]

However we could indeed build the partial object on demand. It's an interesting feature, I'll see about adding it, but I think it should either be a different method or an optional paramter, e.g. parser.value(partial: true), to avoid confusion.

an API that returns not processed data something like the following,

Interesting, wouldn't be hard indeed.

@byroot byroot force-pushed the resumable-parser branch from 6162ba8 to 85983e1 Compare June 6, 2026 06:37
Fix: ruby#983

Numerous known issues TODO:

  - `object_start_cursor` recorded in frame becomes invalid if the buffer string
    is reallocated or spilled.
  - The buffer need to be shrunk sometimes.
  - Lot more testing needed.
  - Unclear what to do with top level numbers (and perhaps true/false/null)
  - API is all but final
    - I'd like to be able to "pop" the value, so we don't uselessly keep a
      reference on it.
    - Then methods need to be documented.
  - It would worth trying to make `json_parse_any` exception free.
    - Right now EOF errors have been eliminated in favor of returning
      `false`.
    - We could try to do the same with syntax errors.
    - But then we need to `rb_protect` when calling back into Ruby
      or other unsafe APIs, so perhaps it's best to just accept it.
@byroot byroot force-pushed the resumable-parser branch from 85983e1 to 77f4ff0 Compare June 6, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for parsing chunked data

2 participants